Support RL online quantization with torchao by jerryzh168 · Pull Request #23014 · vllm-project/vllm

jerryzh168 · 2025-08-15T23:53:32Z

Summary:
This is to enable online quant for verl. The PR
added support for initializing a TorchAOConfig object in vllm
through a serialized json file that specifies the type of quantization
people want. Or a json serialized TorchAOConfig object

Code for serializing the config to json:

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

json_str = json.dumps(config_to_dict(config))

LLM(..., quantization="torchao", hf_overrides={"quantization_config_dict_json": json_str})

Code for serializing the config to file

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

with open("torchao_config.json", "w") as f:
    f.write(json.dumps(config_to_dict(config)))

LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"})

This also supports module level config as well through the ModuleFqnToConfig config
https://huggingface.co/docs/transformers/main/en/quantization/torchao#per-module-quantization
although not tested yet.

more configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize

Note: this has incorporated changes from @LiyuanLucasLiu's PR: #23901, although vllm fp8 quant method is not supported yet, we can add that in a separate PR

Test Plan:
pytest tests/quantization/test_torchao.py -k test_on_the_fly_quant
pytest tests/quantization/test_torchao.py -k test_reload_weights

and regression tests
pytest tests/quantization/test_torchao.py

Reviewers:

Subscribers:

Tasks:

Tags:

github-actions · 2025-08-15T23:53:39Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces a mechanism to initialize TorchAOConfig from a file, which is a great step towards enabling on-the-fly quantization. The changes span configuration, the torchao quantization layer, and weight loading utilities. While the overall direction is good, I've identified a few critical and high-severity issues. There's a significant logic bug in weight_utils.py that seems to prevent the feature from working on models that are not already quantized. Another critical issue is that dummy weight initialization for profiling has been commented out, which will likely break profiling runs. Additionally, I've pointed out a couple of high-severity issues in the new torchao code related to a potential TypeError from an unsafe method signature and a hardcoded dtype marked as a "temp hack". I've provided specific suggestions to address each of these points.

vllm/model_executor/model_loader/weight_utils.py

vllm/model_executor/layers/quantization/torchao.py

Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:

jerryzh168 · 2025-08-18T21:42:12Z

waiting on verl to confirm the API changes make sense first, before cleaning up this PR for review

vllm/config/__init__.py

vllm/model_executor/layers/quantization/torchao.py

jerryzh168 · 2025-10-01T18:14:57Z

can't repro quantization test timeout locally, rebasing and running the tests again to see if persists

jerryzh168 · 2025-10-01T20:54:05Z

OK quantization tests passed, Language Model Tests failing but they are not related to the changes I think

also saw the Language Model Tests failing in main: https://buildkite.com/vllm/ci/builds/33089/steps/canvas

I think it's safe to merge now

Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

**Summary:** Existing support for `load_in_fp8=True` performs an offline quantization when loading the initial model. This is no longer necessary as of vllm==0.12.0 (after vllm-project/vllm#23014), where we can quantize the model on-the-fly when we load it: ``` llm = LLM( ... hf_overrides={ "quantization_config_dict_str": json.dumps(torchao_config), }, ) ``` **Test Plan:** https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423

jerryzh168 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners August 15, 2025 23:53

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

jerryzh168 mentioned this pull request Aug 15, 2025

[vllm] feat: Support online quant for rollout with torchao verl-project/verl#3084

Merged

7 tasks

jerryzh168 marked this pull request as draft August 16, 2025 00:01

jerryzh168 force-pushed the torchao-on-the-fly-quant branch 2 times, most recently from 9a3bf05 to c8b5d20 Compare August 27, 2025 20:39

hmellor reviewed Sep 1, 2025

View reviewed changes

vllm/config/__init__.py Outdated Show resolved Hide resolved

jerryzh168 mentioned this pull request Sep 3, 2025

[feat] preserve metadata for quantized model weight reload #23901

Draft

5 tasks

22quinn reviewed Sep 8, 2025

View reviewed changes

vllm/model_executor/layers/quantization/torchao.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/torchao.py Outdated Show resolved Hide resolved

jerryzh168 force-pushed the torchao-on-the-fly-quant branch from c8b5d20 to 7548891 Compare September 17, 2025 04:07

jerryzh168 changed the title ~~Allows initialize TorchAOConfig object through quantization_config_file~~ Support on the fly quantization with torchao Sep 17, 2025

jerryzh168 force-pushed the torchao-on-the-fly-quant branch 4 times, most recently from a0f395b to bf57db6 Compare September 18, 2025 00:33

jerryzh168 requested a review from 22quinn September 18, 2025 00:37

jerryzh168 force-pushed the torchao-on-the-fly-quant branch from bf57db6 to 8538d42 Compare September 18, 2025 00:48

jerryzh168 force-pushed the torchao-on-the-fly-quant branch from 2b3cfe0 to b073702 Compare October 1, 2025 18:14

vllm-bot merged commit c312468 into vllm-project:main Oct 1, 2025
49 of 52 checks passed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

Support RL online quantization with torchao (vllm-project#23014)

815431c

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

Support RL online quantization with torchao (#23014)

2ae74a8

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

22quinn mentioned this pull request Oct 9, 2025

[RFC]: In-place weights loading #19886

Closed

1 task

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

Support RL online quantization with torchao (vllm-project#23014)

07f3697

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

jerryzh168 mentioned this pull request Oct 21, 2025

[Quantization] Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema #27280

Open

5 tasks

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

Support RL online quantization with torchao (vllm-project#23014)

96eee77

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

Support RL online quantization with torchao (vllm-project#23014)

725cd15

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

Support RL online quantization with torchao (vllm-project#23014)

9df8773

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

andrewor14 mentioned this pull request Dec 11, 2025

FP8: Load model on-the-fly in vLLM unslothai/unsloth#3717

Merged

kylesayrs mentioned this pull request Jan 23, 2026

[QeRL] Layerwise Reloading #32133

Merged

This was referenced Jan 28, 2026

[RFC]: online quantization user facing API #32412

Open

[RFC]: AMD-Quark Online Quantization #31028

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support RL online quantization with torchao#23014

Support RL online quantization with torchao#23014
vllm-bot merged 1 commit intovllm-project:mainfrom
jerryzh168:torchao-on-the-fly-quant

jerryzh168 commented Aug 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 commented Aug 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 commented Oct 1, 2025

Uh oh!

jerryzh168 commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

jerryzh168 commented Aug 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 commented Aug 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 commented Oct 1, 2025

Uh oh!

jerryzh168 commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jerryzh168 commented Aug 15, 2025 •

edited by github-actions bot

Loading